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ABSTRACT In this paper, we introduce an approach to the protein fold- 
ing problem from the point of view of statistical physics. Protein folding is 
a stochastic process by which a polypeptide folds into its characteristic and 
functional 3D structure from random coil. The process involves an intricate 
interplay between global geometry and local structure, and each protein seems 
to present special problems. The first part of this chapter contains a concise 
discussion on kinetics versus thermodynamics in protein folding, and introduce 
the statistical physics basis of protein folding. In the second part, we introduce 
CSAW (conditioned self-avoiding walk), a model of protein folding that com- 
bines the features of self-avoiding walk (SAW) and the Monte Carlo method. In 
this model, the unfolded protein chain is treated as a random coil described by 
SAW. Folding is induced by hydrophobic forces and other interactions, such as 
hydrogen bonding, which can be taken into account by imposing conditions on 
SAW. Conceptually, the mathematical basis is a generalized Langevin equation. 
Despite the simplicity, the model provides clues to study the universal aspects 
while we overlook details and concentrate only on a few general properties. To 
illustrate the flexibility and capabilities of the model, we consider several ex- 
amples, including helix formation, clastic properties, and the transition in the 
folding of myoglobin. From the CSAW simulation and physical arguments, we 
find a universal elastic energy for proteins, which depends only on the radius of 
gyration R g and the residue number N. The elastic energy gives rise to scaling 
laws R g ~ N u in different regions with exponents v = 3/5,3/7, 2/5, consistent 
with the observed unfolded stage, pre-globule, and molten globule, respectively. 
These results indicate that CSAW can serve as a theoretical laboratory to study 
universal principles in protein folding. 



1 Introduction 



In spite of considerable efforts in the past decades, the protein folding problem 
is still unsolved and remains one of the most basic intellectual challenges in 
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The principle through which the amino acid sequence determines the native 
structure, as well as the dynamics of the process, remain open questions. 

In the past 36 years, thinking in the protein folding field has being strongly 
influenced by Christian Anfinsen's "thermodynamic hypothesis" . This hypoth- 
esis states that the native state of a protein is the one with the lowest Gibbs 
free energy, and is determined by the totality of interatomic interactions of the 
amino-acid sequence, in a given environment (Anfinsen 1973). Predicting the 



3D structure of a protein from its amino-acid sequence is one of the most impor- 



tant goals in the protein folding problem (Moult et al. 1995 Zhang & Skolnick 



2005 Zhang 2008). 



Despite this widely established thermodynamic hypothesis (Dill 1990 Kim 



\k. Baldwin] |1990[ ), recent experiments have suggested that there may be excep 



tions, especially for larger and complex proteins (Baker & Agard 1994 Baker 



19981 Lazaridis & Karplus 2003 Shakhnovich 2006). For these proteins, the 



native conformations correspond to the kinetically most accessible state (kinetic 



control), instead of the most stable one (thermodynamic control) (Lazaridis & 



Karplus 20031. This would be the case when the barriers between conforma- 



tions are too high to be ergodic kinetically, and the protein ends up at a state 
of local minimum energy that is accessible in available time. 

Molecular dynamics (MD) is an invaluable tool with which to study protein 
folding dynamics in vitro (Day & Daggett 2003 Scheraga et al. 2007). MD 



solves the Newtonian equations of the motion of all atoms in a protein on a 
computer, using appropriate interatomic potentials. To describe the solvent, 
one includes thousands of water molecules explicitly, treating all atoms in the 
water on the same footing as those on the protein chain. For the integration 
algorithm to be stable, the time step At must be an order of magnitude smaller 
than the fastest motions of the system, typically the vibration of a covalent 
bond, whose period is the order of 10 fs. Thus, the time step is of the order 
of 1 fs (McCammon et al. 1977 Pearlman et al. 1995). Not surprisingly, 



such an extravagant use of computing power is so inefficient that one can follow 
the folding process only to about a microsecond ( Duan & Kollman 1998[ |Jang 



et al. 2003 ), whereas the folding of a real protein takes from seconds to minutes. 



The use of reduced models (meso-scopic, coarse-grained, implicit-solvent) with 
physically based potentials is a reasonable trade-off between computational cost 



and accuracy (Levitt 1976 Kazmierkiewicz et al. 2002 2003| Nielsen et al. 



2004 Huang 2007 Sun 2007). 



Proteins with different amino acid sequences can invoke quite different fold- 
ing mechanisms (Daggett & Fersht 20031, while proteins with high sequence 
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similarity can end up with very different folds ( Alexander et al 
[ 2005). Nevertheless, universal aspects do emerge, if one overlooks details and 
concentrates only on a few general properties. These will be the main subject 
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in this chapter. 

We are concerned here with a perspective of protein folding from the point 



of view of statistical physics (Huang 2005). After all, the protein is a chain 



molecule immersed in water, and, like all physical systems, will tend towards 
thermodynamic equilibrium with the environment. The process is stochastic, 
involving an intricate interplay between global geometry and local structure. 
Our goal is to design a model amenable to computer simulation in a reasonable 
time, and to investigate the physical principles of protein folding, in particular 
the relative importance of various interactions. 

We treat the protein as a molecular chain performing Brownian motion in 
water, regarded as a medium exerting random forces on the chain, with the 
concomitant energy dissipation. In addition, we include regular (non-random) 
interactions within the chain, as well as between the chain and the medium. 
Protein folding is a stochastic process of conformation changes, and must be 
analyzed in terms of a statistical ensemble, but not of a single pathway. In 
this perspective, we study protein folding by defining a transition probabilities 
between the conformations, which are described through a Monte Carlo type 
model. 

The unfolded chain is assumed to be a random coil described by SAW (self- 



avoiding walk), as suggested by Flory some time ago (Flory 1953). That is, 
links in the chain correspond to successive random steps, in which the chain is 
prohibited from revisiting an occupied position. 

We model the protein chain in 3D space, keeping only degrees of freedom 
relevant to folding, which we take to be the torsional angles between successive 
links. In a computer simulation, we first generate an ensemble of SAWs, and 
then choose a sub-ensemble through a Monte Carlo method, which generates 
a canonical ensemble with respect to a Hamiltonian that specifies the inter- 



actions. We call the model CSAW (conditioned self-avoiding walk) (Huang 



2007 1 . Mathematically speaking, it is based on a Langevin equation describing 
the Brownian motion of a chain with interactions. There seems little doubt 
that such an equation does describe a protein molecule in water, for it is just 
Newton's equation with the environment treated as a stochastic medium. 
Two types of interactions are included in our initial formulation: 

• the hydrophobic interaction with the medium, which causes the chain to 
fold; 

• hydrogen bonding within the chain, which leads to secondary structures. 

The model can be implemented efficiently, and is flexible enough to be used as a 
theoretical laboratory. In our initial studies, we keep only the two interactions 
listed above, in order to describe the folding dynamics qualitatively. Other 
interactions, such as electrostatic and van der Waals interactions, can be add 
as refinements. 

Both CSAW and MD are based on Newtonian mechanics, and differ only 
in the idealization of the system. In CSAW we replace the thousands of water 
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molecules used in MD by a stochastic medium — the heat reservoir of statisti- 
cal mechanics. With this simplification, we consider the main interactions — 
hydrophobic interaction and heat dissipation, and ignore inessential degrees of 
freedom, such as small vibration in the length and angles of the chemical bonds 
in the protein chain. The advantages of this idealization are that we 

• avoid squandering computer power on irrelevant calculations; 

• gain a better physical understanding of the folding process. 

After a brief review of the basics of protein folding and stochastic processes, 
we shall describe the model in more detail, and illustrate its use through exam- 
ples involving realistic protein fragments. We will demonstrate helix formation, 
clastic properties, and folding stages. 

Our results indicate that the CSAW model can describe qualitative features 
in the folding of simple proteins, and provide physical insight on the mechanisms 
of protein folding. 



2 Protein basics 
2.1 The protein chain 

The protein is a polypeptide chain consists of a sequence of units or "residues" , 
which are amino acids chosen from a pool of 20. This sequence is called the 
primary structure. Along the protein chain, the amino acids are connected by 
peptide bonds between the C atom of one residue and the nitrogen atom of the 
next. The center of each amino acid is a carbon called C a . The amino acids 
differ from each other only in the side chains connected to the C a 's. The side 
chains can be grouped to different classes by their hydrophobicity, charge, and 
polarity (iBranden & Toozel 119981). 



A polypeptide chain can also be viewed as repeating peptide units that 
connect one C a atom to the next one along the backbone (Figure [l]). All atoms 
in such a unit are fixed in a plane, the peptide plane, with the bond lengths and 
bond angles very nearly the same in all units in all proteins (Table [TJ . 

The relative orientation of successive units is determined by the two torsion 
angles (aka dihedral angle) (j) and i/j, as schematically illustrated in Figure [I] 
The torsion angles and ip are rotation angles about the N-C Q and C Q -C 
bonds, respectively, the positive sense being defined according to the right-hand 
rule. By definition, the flat chain, with all backbone atoms lying in a plane, 
corresponds to <fi — ip — 180° for all residues. 

When small vibrations in bond lengths and bond angles are neglected, the 
torsion angles are the only degrees of freedom. For our purpose, therefore, a 
protein of N residues has 2(A^ — 1) degrees of freedom, i.e., the angle pairs 
(</>„, tp n )n=ii where n labels the residuaM The angle pair ((t> n ,ip n ) can take 
values only from sterically allowed regions in the "Ramachandran plot" . 

1 The angles <j>l and ipN are n °t relevant 
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residue 



Figure 1: Schematic representation of the protein chain. The center of a residue 
is a carbon atom labeled a. The residues are connected by the peptide bonds 
between the C atom of one residue and the nitrogen atom of the next. All 
atoms connected to the peptide bond lie on a plane, the peptide plane. The 
only degrees of freedom we consider arc the torsion angles </>, tp that specify the 
relative orientation of successive peptide planes. Residues can differ only in the 
side chains labeled R, chosen from a pool of twenty. 



2.2 Secondary structures 

The natural proteins have evolved through natural selection to perform specific 
biological functions, which depend on their 3D structures or conformations. In 
general, the 3D structure in the native state contains one or more domains, each 
of which is made up of ordered secondary structures, such as alpha (a) helices 
or beta sheets, and disordered loops. When the secondary structures are 
blurred over, one sees a global shape called the tertiary structure. 

The alpha helix is the most common form of secondary structure in proteins. 
It consists of a stretch of consecutive residues all having torsion angle pairs 
(<f>,ip) « (~60°, -50°) ( |Pauring et aT] |1951| ). The helix has 3.6 residues per 



turn. A hydrogen bond connects the (CO) in residue n to the (NH) in residue 
n + 4. There are variations in which residue n is bonded to n + 3 or n + 5, 
instead of n + 4. 

The beta sheet is another form of secondary structure, in which different 
beta strands are connected laterally by five or more hydrogen bonds, forming 
a twisted pleated sheet. Here, a beta strand refers to a stretch of amino acids 
whose peptide backbones are almost fully extended, with (<f>, ip) s» (—135°, 135°). 
In the Ramachandran plot this corresponds to a broad region in the upper left 
quadrant. 



2.3 Hydrogen bonds 

A hydrogen bond (H-bond) refers to the sharing of the H atom between the 
groups (N-H) and (0=C). It is formed when (a) the distance between H and 
O fall within certain limits, and (b) the chemical bonds in the two groups are 
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antiparallel, within a tolerance. The bond is partly of electrostatic nature, and 
partly covalent. It is weaker than the usual covalent or ionic bond, but stronger 
than the van der Waals interaction. 

The number of hydrogen bonds in a typical protein is very large, and they 
are important for the mechanical stability of a conformation. The (NH) and 
(CO) groups on the backbone tend to form hydrogen bonds with each another. 
Hydrophilic side chains can also form hydrogen bonds. They are usually located 
on the protein surface, and bond with the water solvent. 



2.4 Hydrophobic effect 

In the water molecule H2O, all the atoms can form hydrogen bonds with another 
water molecule. Thus, water in bulk consists of a network of hydrogen bonds. 
In ice, they form a 3D lattice in which each H2O has 4 nearest neighbors. 

In the medium of protein folding, liquid water, there is a fluctuating network 
of H-bonds with an average life time of 9. 5ps. hydrophilic That is, the bonding 



partners change on a time scale of lOps = 10 sec ( jGarrett fc Crisham 1999) 



A foreign molecule introduced into water would disrupt the network, unless it 
can participate in hydrogen bonding. If it can form H-bond with water, it is 
said to be "soluble" , or "hydrophilic" , and will be received by water molecules 
as one of their kind. Otherwise it is unwelcome, and said to be "insoluble" , or 
"hydrophobic" . Protein side chains can be hydrophobic or hydrophilic^] but the 
(CO) and (NH) groups on the backbone are hydrophilic. 

When immerse in water, the protein chain rearranges its conformation in or- 
der to shield the hydrophobic residues from water. In effect, the water networks 
squeezes the protein into shape. This is called the "hydrophobic effect". A 
"frustration" arises in the rearrangement process, when burying a hydrophobic 
side chain drags part of the backbone into the interior of the protein. Since the 
backbone is always hydrophilic, this robs it of the chance to bond with water. 
The frustration is resolved by the formation of secondary structures, which "use 



up" the hydrogen bonds internally (Branden & Tooze 1998). The folded chain 



reverts to a random coil when the temperature becomes too high, or when the 
pH of the solution becomes acidic. 



2.5 Statistical nature of the folding process 

We have to distinguish between protein assembly inside a living cell (in vivo) and 
folding in a test tube (in vitro). The former process takes place within factory 
molecules called ribosomes, and generally needs the assistance of "chaperon" 
molecules to prevent premature folding. In the latter, the molecules freely fold 
or unfold, reversibly, depending on the pH and the temperature. 

2 The hydrophobic side chains include: Ala (A), Val (V), Leu (L), He (I), Phc (F), Pro (P), 
Met (M), Trp (W). The hydrophilic side chains include: Asp (D), Glu ( E), Lys (K), Arg (R) 
Ser (S ), Thr (T), Cys (C), Asn (N), Gin (Q), His (H), Tyr (Y), Gly (G) ( |Garrett fc Crisham| 
[1999] ). 
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We deal only with folding in vitro, in which many protein molecules undergo 
the folding process independently and they do not fold in unison. We are thus 
dealing with an ensemble of protein molecules, in which definite fractions exist in 
various stages of folding at any given time. The Langevin equation naturally de- 
scribes the time evolution of such an ensemble. Behavior of individual molecules 
fluctuate from the average, even after the ensemble has reached equilibrium. In 
a macroscopic body containing the order of 10 23 atoms, such fluctuations are 
unobservable small. For a protein with no more than a few thousand atoms, 
however, these fluctuations are pronounced. 

To understand the physical basis of protein folding, we should analyze the 
time evolution of the ensemble from the perspective of statistical physics, instead 
of the evolution of a single pathway. 



2.6 Folding stages 



A typical folding process consists of a fast initial collapse into an intermediate 
state call the "molten globule" . The latter takes a relatively long time to un- 
dergo fine adjustments to reach the native state. The collapse time is generally 



less than 300/is (Akiyama et al. 2003 Uzawa et al. 2004 Kimura et al. 



2005J, while the molten globule can last as long as 10 minutes. In some proteins 



there is evidence for a pre-globule stage (Uversky & Ptitsyn 1996 Uversky 



2002). 



In the unfolded state, a protein chain is extended and flexible, whereas it 
is globular and compact in the folded, native state. The molten globule is as 
compact as the folded state, and possesses most of the secondary structures. 
The pre-globule is unstructured and less compact. 

An overall characteristic of the protein structure is the radius of gyration R g , 
the root-mean-square separation between residues, which can be measured by 



small angle X-ray scattering (Chu 1974 Berne & Pecora 1976 Doi & Edwards! 



19861. Stages in the folding process can be characterized by scaling relations of 
N v . The unfolded protein chain, which is akin to a homopolymer, 



the form R„ 



has index v — 3/5 according to Flory's mode l (|Flory||1953[). The observed index 
of pre-globule is v = 0.411 ± 0.016 3/7 (|Uversky| |2002| ) The native state 
proteins has a smaller index v = 2/5 ( |Arteca 1994 1995 1996 Hong & Lei 



2009). 



To anticipate results described in more detail later, we propose the fold- 
ing stages depicted in Figure [2j with the indicated time scales and the scaling 



relations R a 



3 Physical basics 



3.1 Stochastic process 



Protein folding is a stochastic process ( Michalet et al. 2006 Dill et al. 2008 ) 



involving random forces ever present in the formation and destruction of hy- 
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Figure 2: Suggested stages in protein folding based on physical arguments and 
computer simulations to be discussed later. Each stages is characterized by 
a scaling law of the form R g ~ N u , where R g is the radius of gyration and 
N is the number of residues. While the existence of these stages should be 
a universal property of proteins, the visibility of the pre-globule stage is not 
universal, because its lifetime depends on details of short-ranged interactions. 
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drogen bonds, thermal fluctuations in covalent bonds and bond angles, etc. An 
individual protein molecule does not have a fixed conformation, even in the 
equilibrium state; it is characterized by a probability distribution about average 
conformations. 

The folding process is described by the time evolution of a probability dis- 
tribution P(X,t), where X denotes a conformation (or configuration). We can 
model it through the master equation ( van Kampen| [1992 ) 



dP{X,t) 
dt 



= J{W{X\X',t)P{X',t)-W{X'\X,t)P{X,t)}dX' (1) 



Here W(X \X' , t) is the transition probability, at time t, per unit time, from the 
conformation X' to X. 

For simplicity, we assume that protein folding is a Markov process, which 
means the transition probability depends only on its present state, and not on 
any past state. That is, there is no memory of the previous history. The transi- 
tion probability therefore is independent of the time t, and rewritten W(X\X'). 
This transition probability is all we need to describe the folding process. The 
equilibrium state is characterized by the probability distribution P(X) satisfy- 
ing 

J {W{X\X')P{X') - W{X'\X)P{X)}dX' = (2) 

for all conformations X. 

The Markov assumption is the simplest one can make, and it leads to the 
well-established Gaussian distribution, the "bell curve" applicable in many sta- 
tistical problems. In the macroscopic world, there is of course a persistence of 
memory. The only question is how important it is. The study of non-Markovian 
processes leads to the field of "complexity" , upon which we will not tread. 

3.2 Brownian motion 

In the Brownian motion of a single particle suspended in a medium, one can 
physically verify that the Markov hypothesis is justified. The transition proba- 
bility can be derived analytically as follows. 

Consider ID Brownian motion in a potential well U(x). The position x(t) 
is a stochastic variable described by the Langevin equation (Huang 2009[ ) 



mx = F(t)-jx-U'(x), (3) 

where U'(x) = dU{x)/dx. The force on the particle by the medium is split into 
two parts: a random fluctuation F(t) and a damping force —jx. The random 
force is a member of a statistical ensemble with the properties 

(F(t)F(t>)) = c S(t - t>) W 
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where the brackets ( ) denote ensemble average. The two forces are not inde- 
pendent, but related through the fluctuation-dissipation theorem: 



Co 

2t 



= k R T 



(5) 



where ks = 1-38 x 10~ 23 J/K is Boltzmann's constant, and T is the absolute 
temperature. 

In an overdamped medium, we can omit the particle acceleration, and rewrite 
([3| as a stochastic differential equation 

1 x = F{t)-U\x). (6) 

In a coarse time scale, the transition Ax = x' — x in At, starting from x, satisfies 



{AX k = -I'' U\x), =2D + (j-'U\x)fAt, 



At 



(7) 



where D = IcbT/j is the diffusion constant by Einstein's relation. 

Let W(Ax, At; x) to be the probability of a transition from x to x' = x + Ax 
in At, starting from x. From |7| we can deduce 



W{Ax, At;x) 



1 



V4tt D At 



exp 



(Ax + j^U'ix))' 
4D At 



(8) 



This Gaussian distribution verifies the central limit theorem, according to which 
the probability of the sum of a large number of any independent random vari- 
ables is Gaussian (Huang 2009). 

From Q, the transition probability is obtained by setting At = 1: 



W{x'\x) = 



exp 



(Ax + j-'U'jx))' 
AD 



(9) 



This enables us to write down the equation for the probability distribution 
P{x,t). When the potential is absent, the equation can be solved exactly, and 
also simulated by random walk. Both methods lead to di ffusion, in which th e 
position has a Gaussian distribution with variance \/2Dt (van Kampen 1992). 

When there is an external force, we may not be able to solve the equation 
analytically, but we can still simulate it on a computer by conditioned random 
walk as discussed below. 



3.3 Conditioned random walk 



In the conditioned random walk, we first generate a trial step by random walk, 
but accept it with a certain probability. To obtain the probability of a transition 
Aa; over a time interval At, we integrate W{Ax, At; x) over At, and get 



W{Ax;x) 



W{Ax, At; x)dAt = cexp 



AU(x) 
2L> 7 



1 



AU{x) 
\AU{x) 



(10) 
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Here AU{x) « U'{x) Ax, and c = |l/(-7 _1 C/'(cc))| = \At/(Ax) x \. The exponen- 



tial in (10) gives following Metropolis criteria: 

• if AU < 0, accept it; 

• if AU > 0, accept it with probability exp(— AU/ksT). 

The last condition allows for acceptance even if the energy increases, and this 
simulates thermal fluctuations. 

For simplicity, we have illustrated the method in the overdamped approxi- 
mation; but the results hold in general. 

We can also integrate the equation (|3| directly as a stochastic differential 
equations, as an alternative to conditioned random walk. The equivalence of 



these two methods is illustrated by an example in the appendix of Lei & Huang 
([20081). 



4 CSAW model 
4.1 Model description 

In protein folding, we are dealing with the Brownian motion of a non-overlapping 
chain with interactions. To follow the time development, we begin with an 
unfolded chain modeled by self-avoiding walk (SAW), and take interactions into 
account through conditions imposed on updates. The resulting model is called 
CSAW (conditioned self-avoid walk). 

We can generate a SAW representing the unfolded protein chain by the pivot 
algorithm, as follows ( |Li et al~\ |1995| Kennedy. 2002). Choose an initial self- 



avoiding walk in 3D continuous space, and hold one end of the chain hxed. 

1. Choose an arbitrary point on the chain as pivot. 

2. Rotate the end portion of the chain rigidly about the pivot (by changing 
the torsional angles at the pivot point). 

3. If this does not result in any overlap, accept the conformation; otherwise 
repeat the procedure. 

By the method, we can generate a uniform ergodic ensemble of SAW's, which 
simulates a Langevin equation of the form 

m fc x fe = F fc (i)- 7fc x fc +V fe (x), (k = l,---,N) (11) 

where the subscripts k labels the residues along the chain. The terms denote 
the regular (non-random) forces that maintain the rigid bonds between succes- 
sive residues, the bond angles, and that prohibit the residues from overlapping 
one another. 
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We now add other regular forces G& = — V Xfc -E(x), where E is the potential 
to be detail latter. Analogy to the above discussion, it gives an acceptance 
probability 

u;ya ^ / !. ifA£<0 , 10 , 

W(Ax; x) = | cM _ AE/2kBT) iiAE>0 (12) 

of a random walk from x to x' = x + Ax. Treating this force via the Metropo- 
lis method results in CSAW, which simulates a generalized Langevin equation 
indicated in the following: 

m fc x fe = F fe (t) - 7fc Xfc + Vfe(x) - V Xfc £(x) (13) 

Treat via SAW Treat via Metropolis 

Now we shall specify the potential explicitly. 
4.2 Implementation of CSAW 

To reiterate, the system under consideration is a sequences of residues (NH- 
Co,i?H-CO) connected by peptides, as shown in Figure]!] The residues can differ 
from one another only through the side chains R attached to the C a atom, and 
there are 20 of them to choose from. There are O and H atoms attached to 
each residue. The atoms between two successive C a atoms are fixed in a plane, 
with fixed bond lengths and bond angles as given in Table. [T] The degrees of 
freedom of the system are the pairs of torsional angles (4> n ,ip n ) specifying the 
relative orientation of two successive peptide planes. 

Table 1: Bond lengths and bond angles used in the CSAW model. 



Bond 


Value (A) 


Angle 


Value (°)( c > 


C Q -N 


1.46 


N-C Q -C 


111.0 


C a -C 


1.51 


i?-C a -C 


101.1 


G a -R 


1.53^ 


R-C a -N 


109.6 


C Q -H 


1.00 


H-C a -C 


101.1 


C-N 


1.33 


C Q -C-N 


114.0 


N-H 


1.00 


C-N-C Q 


123.0 


C-0 


1.24 


H-N-C 


123.0 


N-CH 2 


1.48W 


H-N-C a 


114.0 






O-C-N 


125.0 






0-C-C a 


121.0 



Replaced by C a -H for glycine. 
W The bond length N-CH 2 at proline. 
( c ) Same values for glycine and proline. 



For simplicity we treat all atoms, as well as the side chains, as hard spheres. 
More realistic representation can be implement if desired. The hard sphere 
sizes refer to van der Waals radii (Table [2]) . Two atoms (not neighbors along 
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the chain) are "in contact" if their centers are separated by a distance less than 
the summation of their van der Waals radii. They are regarded as overlapping if 
the distance is less than a factor er(< 1) [^] times the contact distance. We have 
to treat the residues glycine and proline as special cases, since the side chain of 
glycine contains only one H atom, while the N-H group in proline is replaced by 
N-CH 2 (Table [TJ. 

Hydrophobic interaction and hydrogen bonding are the dominant interac- 
tions governing the protein folding and maintaining the stability of the folded 
state. They are introduced in the energy function E in the form 

E = -g x K x - g 2 K 2 (14) 

where K\ is the total contact number of all hydrophobic residues, and K 2 the 
total number of internal hydrogen bonds. The parameters gi(i = 1,2) are 
constants, regarded in this model as universal for all proteins. 

The first term in E expresses the hydrophobic effect, which is proportional 
to the number of water molecules surrounding the hydrophobic side chains, i.e., 

(Maximum No.) — (Total hydrophobic contact No.). 

The maximum number contributes a constant to the total energy and can be 
omitted. The contact number of a residue is the number of atoms touching its 
side chain (not counting the two permanent nearest neighbors along the chain). 
This is illustrated in Figure^ a). 

Table 2: van der Waals radius (A) of the atoms 



N 


C 


R 


H 





1.55 


1.70 


1.95^ 


1.20 


1.50 



( a ) Refer the value for CH3, same for CH2 in proline. 



The contact number measures how well a hydrophobic residue is being 
shielded from the medium. When two hydrophobic residues are in contact, 
the total contact number increase by 2, and this induces an effective attraction 



between hydrophobic residues, as in the simple HP model (Lau & Dill, 1990 1 . 

The second term in E describes internal hydrogen bonding. As illustrated 
in Figure ^h) , an internal hydrogen bond is deemed to have formed between O 
and H from different residues when 

1. the distance between O and H is 2.3A, within given tolerance; and 

2. The bonds C=0 and N-H are antiparallel, within given tolerance. 



5 We use a = 0.6 in our studies. 
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main chain 



(a) (b) 

Figure 3: (a) The hydrophobic residue illustrated here has three contact neigh- 
bors. The permanent neighbors along the chain are not counted. Solid and 
dotted lines correspond to hard sphere and van der Waals radii, respectively, 
(b) Hydrogen bonding occurs between and H on the main chain, from different 
residues. 



Only the combinations gi/2ksT and <?2/2A;bT appear in the Metropolis 
method. They are treated as adjustable parameters. To study the folding pro- 
cess at different temperatures, we define a program temperature T* = ksT / g2, 
and use T* and g\ = g\j gi as independent parameters. 

Note that E only includes the potential energy. We leave out the kinetic 
energy because it contributes only a constant factor to the conformational prob- 
ability of the ensemble, provided the masses involved are constants independent 
of momenta. 

Because we are using the torsion angles as generalized coordinates, the 
canonically conjugate momenta are such that the effective masses depend on 
the coordinates, and thus not constants. However, independent studies have 
shown that the mass variations are generally less than 1%, and thus may be 



ignored (Leong et al. 20091 



5 CSAW results 

We summarize results from previous runs of CSAW. 
5.1 Test runs 

During the testing stage of CSAW, we run a minimal program as described in 



Huang (2007). For a chain of 30 residues, the main findings are the following: 
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Under the hydrophobic forces alone, without hydrogen bonding, the chain 
folds into a reproducible shape. This shows that the hydrophobic effect 
along can produce tertiary structure. There is no secondary structure 
in this case. The chain rapidly collapses to the final structure without 
passing through an intermediate state. 

When there is no hydrophobic force and the interaction consists purely of 
hydrogen bonding, the chain rapidly folds into one long alpha helix. When 
both hydrophobic force and hydrogen bonding are taken into account, 
secondary structure emerges. 

The folding process exhibits two-stage behavior, with a fast collapse fol- 
lowed by slow "annealing" , in qualitative agreement with experiments. 



5.2 Energy landscape 

As a more realistic example, we fold Chignolin, a synthetic peptide of 10 residues 



whose native state is a /?-hairpin (Huang 20071. From an ensemble of 100 



folding paths, starting with the random coil, we construct a view of the energy 
landscape, and exhibit the "folding funnel". We can also examine individual 
paths, which are quite different from one another. A common characteristic is 
that the path gets trapped in some pocket of states for a relatively long time, 
and suddenly break out, only to be trapped again — a behavior reminiscent of 
the Levy flight. 

5.3 Dynamics of helix formation 

To determine the CSAW parameters, we consider Polyalanine (Ala2o), a protein 
fragment of 20 identical amino acids alanine, which is hydrophobic. This protein 
has been studied by many authors, both in vitro and in silico, and the native 



state is known to be a single alpha helix ( Liu & Deber 1998 Smith & Hall 2001 



Shental-Bechor et al. 2005). In the CSAW model, the helix emerges from an 



initial random coil (Lei & Huang 2008). We run multiple trial simulations with 



different parameters g\ and T* in order to find the optimal values, with the 
results 

g\ = 0.05, T* = 0.2. (15) 

According to the philosophy of the model, this should be universal for all pro- 
teins. But, since only the hydrophobic interaction and hydrogen bonding are 
considered, we expect that there will be variations from protein to protein. 

The study reveals some dynamical aspects of helix formation. Starting from 
a random coil, helical structure is nucleated at two locations along the protein 
chain. The nuclei grow and eventually merge into the single helix that is the 
native state. We obtain rate constants by fitting simulation data to exponential 
functions. Using experimental information, we calibrate one Metropolis MC 
step in CSAW as 

1 MC step fa 10" 12 s. (16) 
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6 Elastic energy of proteins 



6.1 Flory analysis 

Stages in protein folding can be characterized by scaling relations of the form 



R g ~N v 



(17) 



between the radius of gyration R g and the residue number N. This exponent 
should be a universal feature independent of specific protein sequence. A com- 
bination of theoretical modeling and experimental data suggests that there are 
three stages with v — 3/5,3/7,2/5, which may be identified as the unfolded 
pre-globule, and molten globule stage, respectively ( Arteca| 1994| Uversky 2002 



Hong & Lei |2009| . We shall look into the existence and universality of these 
stages via CSAW simulations. But first, a theoretical orientation. 

The unfolded protein chain is akin to a homopolymer, for which the 3/5 
scaling law is known for long time from Flory's SAW (self-avoiding walk) model 



(Flory 19531. To derive the exponent, Flory models the free energy by 



Flory 



N 



bN 2 
-Rj 



(18) 



where a and b are temperature-dependent coefficients, and D is the spatial di- 
mension. The first term is the free energy of stretching, in the form of Hooke's 
law. It is assumed that R 2 g scales like N, because that is the behavior in random 
walk. The second term arises from the excluded- volume effect, and is propor- 



tional to N times the density. Minimizing the energy with respect to R g 



leads 



to v = 3/(D + 2), which give v = 3/5 for D = 3. 

The exponent 3/5 coincides with the Kolmogorov exponent in the energy 
spectrum of turbulence. This is no accident, for turbulence consists of a tangle 



of vortex lines, which can be modeled by SAW (Huang 2005 1. 

Hong and Lei ( |Hong fc Lei 2009| generalize Flory's stretching free energy 
by replacing R 2 /N with R*/NW a >-' ■ 



Hong— Lei 



7V( 2 /«)- 



bN 2 
If 



(19) 



where a is the fractal dimension of the conformation. Minimizing this free 
energy leads to v — (a + 2)/[a(D + 2)], which for D = 3 reduces to 

Flory's 3/5 law is recovered by setting the fractal dimension a = 1. 

For proteins in the native state, the fractal dimension is a — 2, as can be 
deduced from analysis of data in the PBD (Protein Data Bank). This leads 
to v = 2/5. At the free-energy minimum, the native-state free energy obeys 
Hooke's law, and scales as 



F n ~Ri 



iV 4 / 5 . 



(21) 
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These exponents will guide our analysis of data from CSAW simulations of 
protein folding. 



6.2 Elastic energy 

What we can directly calculate via CSAW is not the free energy but an "elastic 
energy" . 

We use CSAW to generate an ensemble of folding paths for a number of 
proteins, and examine the average potential energy as a function of R g (Lei & 



Huang 2009 1 . We analyze the simulation data in terms of an elastic potential 
energy, guided by the physical picture that the folding dynamics is dominated 
by the twin actions of the hydrophobic force and hydrogen bonding. 

Five proteins are chosen for simulation, with residue numbers ranging from 
N = 20 to 330 (Table|3]). For each protein, the simulation generates an ensemble 
of 1.2 x 10 4 conformations. 



Table 3: Proteins simulated. N = number of residues, h = fraction of hydropho- 
bic residues. 



Protein name 


ID 


N 


h 


Structure 


Polyalanine 


ala20 


20 


1.000 


1 alpha helix 


Antimicrobial LCI 


2b9k 


47 


0.383 


1 beta sheet 


Tedamistat 


3ait 


74 


0.351 


2 sheets 


Myoglobin 


lmbs 


153 


0.379 


8 helices 


Asparagine 
synthetase 


lias 


330 


0.397 


11 helices 
8 sheets 



We define an energy E(R g , N) as the ensemble average of the model potential 
energy ( 14 ) over all conformations that share a given value of R g (within a 
certain bin size). The gradient —dE/dR g gives the pressure-force experienced 
by the protein as a function of radius, which in principle can be observed in 



2005 Fernandez & Li 2004) 



force spectroscopy experiments (Cecconi et t 
In this sense, we call E(R g ,N) an "elastic energy". 

To plot the data, we scale the energy £ by a factor A 4 / 5 , as suggested 
by (21). Our earlier discussion indicate that the radius R g should scale like 
N", with v = 3/5 and v — 2/5 in different regions. Accordingly we rescale 
the data in two different manners, as shown in Figure [4] As we can see, the 
3/5 plot exhibits universal behavior in the unfolded region, while the 2/5 plot 
shows universality in the collapsed region. The protein Ala20 is exceptional, 
since it is completely hydrophobic. This exception in fact shows the relevance 
of hydrophobicity. 

From physical arguments, we suggest the following analytical form of the 
elastic energy]^] 



E(R g ,N) = aN 4 / 5 + biNRg) 1 ' 2 + c(p)N 2 /R\ 

'For detailed arguments see |Lei fc Huang||2009| l. 



(22) 
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Figure 4: Reseating the data to exhibit universality in the unfolded state (left) 
and collapsed state (right) of five proteins, listed with residue numbers in paren- 
thesis. The case Ala2o is exceptional, being completely hydrophobic. Replottcd 



from Lei & Huang (2009) 



Here, a and b are parameters, and c(p) represents an effective excluded volume, 
which depends on N and R g only through the scaled radius 



R g 

N 2 / 5 ' 



(23) 



We expect a and b to be universal coefficients, but not the effective excluded 
volume, as it depends on the intricate structure of a collapsed chain. 
The various terms have the following physical meaning: 

• The first term a N 4 ^ 5 merely establishes the zero point of energy as that of a 
completely extended chain — a convention used in the CSAW simulations. 

• The second term b(N Rg) 1 ' 2 represents the hydrophobic energy, with b 
depending on the hydrophobicity of the protein chain. 

• The term c(p)N 2 / R~ is a combination of excluded- volume energy and that 
from hydrogen bonding, and the relative importance of these contributions 
depends on the scaled radius p. 

Using a phenomenological approach, we expand c(p) in an inverse power 
series of its argument: 

c(p) = }c n p-'\ (24) 



ri>0 



Terms of higher n become increasingly important as the collapsed chained is 
being compressed, and c(p) can be positive or negative, depending on the signs 
of various short-distance interactions. 

We fit the parameters a, b, c n (n = 0, • • • , 12) to simulation data^] Figure [H] 
shows the fit of (22 1 to the five proteins used. The fit is poor for Ala2o because 



5 The non-zero parameters are: a = —1.50, b = 0.43, and for even n from to 12: c„ 
2"+ 3 x {2.53,-34.49,155.87,-326.50,350.09,-187.19,39.59}. 
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of the exceptional nature of the protein as remarked previously. In the case of 
lias, simulation has not yet reached an equilibrium ensemble. We did not use 
these two cases in fitting the parameters. 



u 






-20 






-40 










/ ala20(20) * 


-60 




2b9k (47) 






3ait (74) 


-80 




1mbs(153) 






11 as (330) * 


-100 




* Not used in fitting 


-120 
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Figure 5: Average potential energy vs. radius of gyration, from computer sim- 



ulations (points) and the universal elastic energy (22) (solid curves) of five 
proteins, listed with residue numbers in parenthesis, in the order of the curves 
from top to bottom. Replotted from Lei & Huang ( 2009 ) 



Thus, we have a "universal" elastic energy that should be 

• truly universal for large R g , when the short-distance interactions are not 
important, and 

• valid only on average at smaller R g , when short-distance interactions be- 
come important. 



7 Stages of protein folding 

We identify a stage by the scaling index v. Such a state is not necessarily stable 
or metastable, i.e., it does not always correspond to a local minimum of the 
clastic energy (22). But even an unstable state has kinetic meaning, and we 



can identify it through physical arguments, by imagining that certain effects 
are "turned off" . 



7.1 Unfolded stage 

In the unfolded stage the natural variable is R g /N 3 / 5 , as revealed by Figure [4] 
(left panel). In a thought experiment, we can render it stable by imagiing that 
the hydrophobic effect is turned off. imaging 



7.2 The pre-globule 

The pre-globule is characterized by v — 3/7. It may or may not be experimen- 
tally visible, depending on details of hydrogen bonding. By "turning off'higher- 
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order terms in c(/?), keeping only the n — term, we have the elastic energy 



Ei(R g , N) = aiV 4/5 + biNRg) 1 ' 2 + c N 2 /R 3 Q 



(25) 



This has a local minimum with scaling law R g ~ N 3 ^ 7 (Figure |6j), which defines 
the ideal pre-globule theoretically. 

Figure [6jb) shows the radius of the ideal pre-globule, which is expected to 
scale according to 



R g sa 4.12iV 3/7 (theoretical pre-globule). 



(26) 



This theoretical prediction is larger than observed data, because the hydrogen- 
bonding attractions have been ignored. By extending the powers in c(p) to 
n = 8, we find that the local minimum satisfies the scaling law 



R g w 3.61iV (observed pre-globule), 



(27) 



in good agreement with the pre-globule observed in some proteins (Figure |6| 
(|Uverskyl|2002). 



The higher powers in c (p) destabilize the pre-globule, due to the fact that 
the attractive contributions from hydrogen bonding may make c{p) change sign. 

"On average", therefore, the pre-globule is not even metastable, and the 
collapsed protein chain continues to shrink until it reaches the most compact 
state, with scaling law R g ~ TV 2 / 5 . 




Figure 6: (a). Elastic energy of lmbs from the potential function p2| with 
the powers in c(p) truncated at n = (dotted line), n = 8 (dashed line), and 
n = 12 (solid curve), respectively, (b). Radius of gyration for different stages. 
Theoretical pre-globule: R g = 4.127V 3 / 7 ; observed pre-globule: R g = 3.61iV 41 ; 
molten globule: R g = 3.017V 2/5 ; the most compact state: R g = 1.66iV 2/5 . 
Experimental data are also shown, the proteins in pre-globule (triangles) and 
molten globule (squares) from Tcherkasskaya & Uversky ( 2001[ ), the most com- 
pact proteins from Arteca (1995) (circles), and the native proteins from PDB 



(dots) (Hong & Lei 



2009). 
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7.3 The molten globule 



In the energy function (22 1 as plotted in Figure |6[a) (solid line), there is a flat 
shoulder, which is almost metastable. Between this shoulder and the lowest 
minimum, corresponding to the most compact state, there is a region that can 
be identified as the molten globule. We identify this region with a single stage 
because the scaling exponent is 2/5 throughout. It is bounded by the curves 
plotted in Figure |6^b): 

R g ps 3.017V 2 / 5 (be ginning of molten globule), , oa * 
R g ps 1.667V 2 / 5 (most compact state). 



As shown in Figure |6jb), data from the PDB show that the R g of all native 
proteins are distributed between these two curves. 



7.4 The native state 

The most compact state is the closest we can get to the native state in CSAW, 
since the detailed structures of the side chains have not been included. As 
Figure [6jb) shows, the native state in real proteins shares the same scaling 
index 2/5 with the molten globule. While the time it takes for a random coil 
to collapse into the molten globule is less than or of the order of microseconds, 
the development of the molten globule to the native state requires seconds or 
longer. This suggests that after the molten globule, the "snapping on" of side 
chains is governed by new mechanisms. 

7.5 Mechanisms of different stages 

In summary, as shown in Figure [2j 

• The unfolded state is characterized by an absence of both hydrophobic 
forces and hydrogen bonding. The dominant interaction is the excluded- 
volume effect, which makes the unfolded protein chain a SAW. 

• Under hydrophobic forces, the protein chain rapidly collapses into the 
pre-globule, which is maintained by a balance between the hydrophobic 
pressure and excluded- volume effect. The density is not high enough for 
significant hydrogen bonding to occur, and the chain can slide against 
itself. The lifetime of this state depends on the importance of hydrogen 
bonding. 

• Upon further compression, hydrogen bonding occurs, and the structure 
acquires rigidity. The protein is now essentially as compact as the native 
state. 

• The side chains "snap on" to native positions, and this process takes a 
macroscopically long time. 
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8 Phase transition in myoglobin 



The transition from pre-globule to molten globule shows up as a sudden increase 
in the rate of hydrogen bonding, in a process akin to a phase transition. We 
illustrate this in the case of myoglobin, as shown in Figure [7] with the number 
of hydrogen bonds function of time. 




Figure 7: Growth of hydrogen bonds in the simulation of myoglobin. The onset 
of the transition from pre-globule to molten globule is marked by a sudden jump 
in the growth rate. Replotted from Lei & Huang (20091 



9 A theoretical limit on protein size 

To describe the effective excluded-volume effect through a single variable de- 
pending only on the residue number TV, we define the "excluded volume" c as 
the average of c(p) over all physically possible values of p: 



r"" X c(p)dp, (29) 

Pmin J p min 



where p m in and p max are respectively the minimum and maximum values of the 
scaled radius p — R g /N 2 / b in the ensemble. 

We have p min — 1.66 according to the radius of the most compact state 



given by ( 28 1 . From Figure |6Jb) , when there is no hydrogen bonding, the chain 
collapses to a state with scaling law R g ps 4.127V 3 / 7 . This can serve as the 
maximal radius in the post-collapsed ensemble, and thus 

p max PS 4.12AT 1 / 35 . (30) 

Figure [8] shows c is a function of p max . 

From Figure [8j we see that c changes from negative to positive at p max ps 
6.85. When 4.12 < p max < 6.85, the excluded volume is negative, and therefore 
attractive interactions are dominant, which drive the chain to a compact state. 
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Figure 8: The excluded volume c as a function of p n 



When p max > 6.85, however, the excluded volume is positive, and repulsive 
interactions are dominant. In this case, the chain cannot fold to a more compact 
state, because not enough hydrogen bonds can be formed after the hydrophobic 
collapse. 



The value /? ma x = 6.85 and (30 1 yield a critical chain length 



iVcrit = 5.34 x 10 7 . (31) 

Thus, a super protein (N > N crit ) do not have a stable folded state. 

We emphasize here that the above conclusion is valid only for a natural 



protein, i.e., the fraction of hydrophobic residues is around 40% (Hong & Lei 



2009). It may not be the case for a fully hydrophobic chain. 



10 Discussion and outlook 

In this chapter, we discuss protein folding from the perspective of statistical 
physics. We treat the protein as a physical chain immersed in water, and tends 
towards thermodynamic equilibrium. The process is stochastic, and can be 
modeled with the master or the Langevin equation. We introduce the CSAW 
model (conditioned self-avoiding walk), which simulates the Langevin equation 
for the protein chain. With this model, we are able to study various aspects of 
protein folding. 

In treating protein folding as a physical process, the CSAW model differs 
from MD in two important aspects, namely 

• irrelevant degrees of freedom are ignored; and 

• the environment is treated as a stochastic medium. 

These, together with simplifying treatment of interactions, enable the model to 
produce qualitatively correct results with minimal demands on computer time. 
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An important simplification is the separation of the hydrophobic effect and 
hydrogen bonding, as expressed by the separate terms in the potential energy. 
Since both effects arise physically from hydrogen bonding, it is not obvious that 
we can make such a separation. The implicit assumption is that hydrogen bond- 
ing with water involves only the side chains, while internal hydrogen bonding 
involves only the backbone. This property is supported by statistical data, but 
should be a result rather an assumption of the model. We should try to remedy 
this in an improved version of the model. 

The CSAW model successfully folds polyalanine (Ala2o), a protein fragment 
of 20 identical amino acids alanine, from random coil to its native state, which is 



a single alpha helix (Lei & Huang 2008 ). We also use the CSAW model to study 



the elastic energy of proteins. The elastic energy gives rise to scaling relations 
of the form R g ~ N v in different stages of folding, with v = 3/5, 3/7, 2/5 for the 
unfolded stage, the pre-globule, and the molten globule, respectively. The the- 
oretical predictions agree well with experimental data (Figure [6]). These stages, 
together with mechanisms responsible for their formation, are summarized in 
Figure [2j 

These examples show that, despite its simplicity, CSAW incorporates im- 
portant physical principles governing protein folding. 

We can refine CSAW by adding refinements, including the following: 



• All-atom side chains with fractional hydrophobicity (Sun 2007). 

• Electrostatic interactions. 

• Replacement of hard-sphere repulsions by Lennard- Jones potentials. 

We hope to make progress on the protein folding problem with this theoret- 
ical laboratory. 
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